IPL Data Analysis
Introduction
Sports analytics is one of the metric which will be done in all types of games all over the world. This will not enhance the prediction of the game, it will help us in analyzing the team performance and also the individual player performance through which team can improve its performance and drive towards the winning line. In this analysis we will be analyzing one of the famous sports cricket and we will be taking data of the IPL game.
Loading required packages for the project
# loading required packages
library(lubridate)
library(tidyverse)
library(gapminder)
library(ggplot2)
library(knitr)
library(skimr)
library(dplyr)
library(ggthemes)
library(data.table)
library(reshape)
library(insight)
library(stringr)
library(plotly)Explaining the data set
Overview Dataset
Matches Dataset
Deliveries Dataset
Deliveries2 Dataset
Loading the data sets into r
# reading all the csv files from the data sets
overview <- read.csv("overview.csv",na.strings=c("","NA"))
matches <- read.csv("matches.csv",na.strings=c("","NA"))
deliveries <- read.csv("deliveries.csv",na.strings=c("","NA"))
deliveries2 <- read.csv("deliveries2.csv",na.strings=c("","NA"))Our data set focuses more on the cricket game, so we want to analyze it in a way that allows us to explore different data sets and make more in-depth observations about the variables and any missing values. We opt for the matches data set for this. Matches was chosen primarily because it is the data set that contains the most accurate information about the important labels that we need to combine to show which teams won more games.
Exploring Important Variables in the data sets
Match Dataset
# matches data set
glimpse(matches)## Rows: 756
## Columns: 18
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ season <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ city <chr> "Hyderabad", "Pune", "Rajkot", "Indore", "Bangalore", …
## $ date <chr> "2017-04-05", "2017-04-06", "2017-04-07", "2017-04-08"…
## $ team1 <chr> "Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lion…
## $ team2 <chr> "Royal Challengers Bangalore", "Rising Pune Supergiant…
## $ toss_winner <chr> "Royal Challengers Bangalore", "Rising Pune Supergiant…
## $ toss_decision <chr> "field", "field", "field", "field", "bat", "field", "f…
## $ result <chr> "normal", "normal", "normal", "normal", "normal", "nor…
## $ dl_applied <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ winner <chr> "Sunrisers Hyderabad", "Rising Pune Supergiant", "Kolk…
## $ win_by_runs <int> 35, 0, 0, 0, 15, 0, 0, 0, 97, 0, 0, 0, 0, 17, 51, 0, 2…
## $ win_by_wickets <int> 0, 7, 10, 6, 0, 9, 4, 8, 0, 4, 8, 4, 7, 0, 0, 6, 0, 4,…
## $ player_of_match <chr> "Yuvraj Singh", "SPD Smith", "CA Lynn", "GJ Maxwell", …
## $ venue <chr> "Rajiv Gandhi International Stadium, Uppal", "Maharash…
## $ umpire1 <chr> "AY Dandekar", "A Nand Kishore", "Nitin Menon", "AK Ch…
## $ umpire2 <chr> "NJ Llong", "S Ravi", "CK Nandan", "C Shamshuddin", NA…
## $ umpire3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
skim(matches)| Name | matches |
| Number of rows | 756 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city | 7 | 0.99 | 4 | 14 | 0 | 32 | 0 |
| date | 0 | 1.00 | 8 | 10 | 0 | 546 | 0 |
| team1 | 0 | 1.00 | 13 | 27 | 0 | 15 | 0 |
| team2 | 0 | 1.00 | 13 | 27 | 0 | 15 | 0 |
| toss_winner | 0 | 1.00 | 13 | 27 | 0 | 15 | 0 |
| toss_decision | 0 | 1.00 | 3 | 5 | 0 | 2 | 0 |
| result | 0 | 1.00 | 3 | 9 | 0 | 3 | 0 |
| winner | 4 | 0.99 | 13 | 27 | 0 | 15 | 0 |
| player_of_match | 4 | 0.99 | 5 | 17 | 0 | 226 | 0 |
| venue | 0 | 1.00 | 8 | 52 | 0 | 41 | 0 |
| umpire1 | 2 | 1.00 | 5 | 21 | 0 | 61 | 0 |
| umpire2 | 2 | 1.00 | 5 | 21 | 0 | 65 | 0 |
| umpire3 | 637 | 0.16 | 6 | 23 | 0 | 25 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1792.18 | 3464.48 | 1 | 189.75 | 378.5 | 567.25 | 11415 | ▇▁▁▁▁ |
| season | 0 | 1 | 2013.44 | 3.37 | 2008 | 2011.00 | 2013.0 | 2016.00 | 2019 | ▇▆▆▅▇ |
| dl_applied | 0 | 1 | 0.03 | 0.16 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| win_by_runs | 0 | 1 | 13.28 | 23.47 | 0 | 0.00 | 0.0 | 19.00 | 146 | ▇▁▁▁▁ |
| win_by_wickets | 0 | 1 | 3.35 | 3.39 | 0 | 0.00 | 4.0 | 6.00 | 10 | ▇▁▃▃▁ |
Delivery Dataset
# deliveries data set
glimpse(deliveries)## Rows: 179,078
## Columns: 21
## $ match_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ inning <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ batting_team <chr> "Sunrisers Hyderabad", "Sunrisers Hyderabad", "Sunris…
## $ bowling_team <chr> "Royal Challengers Bangalore", "Royal Challengers Ban…
## $ over <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ ball <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4,…
## $ batsman <chr> "DA Warner", "DA Warner", "DA Warner", "DA Warner", "…
## $ non_striker <chr> "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan", "S Dh…
## $ bowler <chr> "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS M…
## $ is_super_over <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wide_runs <int> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bye_runs <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ legbye_runs <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ noball_runs <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ penalty_runs <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ batsman_runs <int> 0, 0, 4, 0, 0, 0, 0, 1, 4, 0, 6, 0, 0, 4, 1, 0, 0, 3,…
## $ extra_runs <int> 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_runs <int> 0, 0, 4, 0, 2, 0, 1, 1, 4, 1, 6, 0, 0, 4, 1, 0, 0, 3,…
## $ player_dismissed <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DA Warne…
## $ dismissal_kind <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "caught",…
## $ fielder <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Mandeep …
skim(deliveries)| Name | deliveries |
| Number of rows | 179078 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| batting_team | 0 | 1.00 | 13 | 27 | 0 | 15 | 0 |
| bowling_team | 0 | 1.00 | 13 | 27 | 0 | 15 | 0 |
| batsman | 0 | 1.00 | 5 | 20 | 0 | 516 | 0 |
| non_striker | 0 | 1.00 | 5 | 20 | 0 | 511 | 0 |
| bowler | 0 | 1.00 | 5 | 17 | 0 | 405 | 0 |
| player_dismissed | 170244 | 0.05 | 5 | 20 | 0 | 487 | 0 |
| dismissal_kind | 170244 | 0.05 | 3 | 21 | 0 | 9 | 0 |
| fielder | 172630 | 0.04 | 5 | 21 | 0 | 499 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match_id | 0 | 1 | 1802.25 | 3472.32 | 1 | 190 | 379 | 567 | 11415 | ▇▁▁▁▁ |
| inning | 0 | 1 | 1.48 | 0.50 | 1 | 1 | 1 | 2 | 5 | ▇▇▁▁▁ |
| over | 0 | 1 | 10.16 | 5.68 | 1 | 5 | 10 | 15 | 20 | ▇▇▇▇▇ |
| ball | 0 | 1 | 3.62 | 1.81 | 1 | 2 | 4 | 5 | 9 | ▇▇▃▅▁ |
| is_super_over | 0 | 1 | 0.00 | 0.02 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| wide_runs | 0 | 1 | 0.04 | 0.25 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
| bye_runs | 0 | 1 | 0.00 | 0.12 | 0 | 0 | 0 | 0 | 4 | ▇▁▁▁▁ |
| legbye_runs | 0 | 1 | 0.02 | 0.19 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
| noball_runs | 0 | 1 | 0.00 | 0.07 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
| penalty_runs | 0 | 1 | 0.00 | 0.02 | 0 | 0 | 0 | 0 | 5 | ▇▁▁▁▁ |
| batsman_runs | 0 | 1 | 1.25 | 1.61 | 0 | 0 | 1 | 1 | 7 | ▇▁▁▁▁ |
| extra_runs | 0 | 1 | 0.07 | 0.34 | 0 | 0 | 0 | 0 | 7 | ▇▁▁▁▁ |
| total_runs | 0 | 1 | 1.31 | 1.61 | 0 | 0 | 1 | 1 | 10 | ▇▁▁▁▁ |
Missing values, Lubridate, Stringr & Summary Statistics
Dealing with missing values
We found some missing values in the data set based on the above individual reports. However, because the missing values in the data set are very low and will have no effect on the final visualization, we decided to omit missing values. Creating a new data sets by removing the “NA” values as shown below.
# omit missing values for matches dataset
matches_o <- na.omit(matches) After removing the NA values in the data set we again using the glimpse and skim functions to get new insights about the data sets.
glimpse(matches_o)## Rows: 118
## Columns: 18
## $ id <int> 7894, 7895, 7896, 7897, 7898, 7899, 7900, 7901, 7902, …
## $ season <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ city <chr> "Mumbai", "Mohali", "Kolkata", "Hyderabad", "Chennai",…
## $ date <chr> "07/04/18", "08/04/18", "08/04/18", "09/04/18", "10/04…
## $ team1 <chr> "Mumbai Indians", "Delhi Daredevils", "Royal Challenge…
## $ team2 <chr> "Chennai Super Kings", "Kings XI Punjab", "Kolkata Kni…
## $ toss_winner <chr> "Chennai Super Kings", "Kings XI Punjab", "Kolkata Kni…
## $ toss_decision <chr> "field", "field", "field", "field", "field", "field", …
## $ result <chr> "normal", "normal", "normal", "normal", "normal", "nor…
## $ dl_applied <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ winner <chr> "Chennai Super Kings", "Kings XI Punjab", "Kolkata Kni…
## $ win_by_runs <int> 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 19, 4, 71, 46, 0, 15, 6…
## $ win_by_wickets <int> 1, 6, 4, 9, 5, 0, 1, 4, 7, 5, 0, 0, 0, 0, 7, 0, 0, 9, …
## $ player_of_match <chr> "DJ Bravo", "KL Rahul", "SP Narine", "S Dhawan", "SW B…
## $ venue <chr> "Wankhede Stadium", "Punjab Cricket Association IS Bin…
## $ umpire1 <chr> "Chris Gaffaney", "Rod Tucker", "C Shamshuddin", "Nige…
## $ umpire2 <chr> "A Nanda Kishore", "K Ananthapadmanabhan", "A.D Deshmu…
## $ umpire3 <chr> "Anil Chaudhary", "Nitin Menon", "S Ravi", "O Nandan",…
skim(matches_o)| Name | matches_o |
| Number of rows | 118 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city | 0 | 1 | 4 | 13 | 0 | 11 | 0 |
| date | 0 | 1 | 8 | 8 | 0 | 94 | 0 |
| team1 | 0 | 1 | 14 | 27 | 0 | 9 | 0 |
| team2 | 0 | 1 | 14 | 27 | 0 | 9 | 0 |
| toss_winner | 0 | 1 | 14 | 27 | 0 | 9 | 0 |
| toss_decision | 0 | 1 | 3 | 5 | 0 | 2 | 0 |
| result | 0 | 1 | 3 | 6 | 0 | 2 | 0 |
| winner | 0 | 1 | 14 | 27 | 0 | 9 | 0 |
| player_of_match | 0 | 1 | 6 | 15 | 0 | 61 | 0 |
| venue | 0 | 1 | 12 | 52 | 0 | 16 | 0 |
| umpire1 | 0 | 1 | 6 | 21 | 0 | 21 | 0 |
| umpire2 | 0 | 1 | 6 | 21 | 0 | 23 | 0 |
| umpire3 | 0 | 1 | 6 | 23 | 0 | 25 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 9572.61 | 1685.65 | 7894 | 7923.25 | 7952.5 | 11319.75 | 11415 | ▇▁▁▁▇ |
| season | 0 | 1 | 2018.49 | 0.50 | 2018 | 2018.00 | 2018.0 | 2019.00 | 2019 | ▇▁▁▁▇ |
| dl_applied | 0 | 1 | 0.03 | 0.16 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| win_by_runs | 0 | 1 | 11.36 | 21.09 | 0 | 0.00 | 0.0 | 14.00 | 118 | ▇▁▁▁▁ |
| win_by_wickets | 0 | 1 | 3.27 | 3.23 | 0 | 0.00 | 4.0 | 6.00 | 10 | ▇▂▅▂▁ |
By Using the Lubridate Function created a new column for day
# By using the weekday from the lubridate library we have created a new column to get the game played day
matches$Day <- wday(as_date(matches$date))By Using the stringr Function counting the player of the match
# Using str_count we will be checking number of times player of the match repeated
p_o_m <- str_count(matches_o$player_of_match, "S Dhawan")
p_o_m_count <- sum(p_o_m)
p_o_m_count## [1] 4
We wonder to check how many times did the player “S Dhawan” won the “player of match”. We found it has 4
Summary statistics for two quantitative variables
Statistics for the win_by_runs grouping with variable city
# summary statistics for the win_by_runs in each city.
data.table(matches_o)[, as.list(summary(win_by_runs)), by="city"]## city Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1: Mumbai 0 0 0 10.437500 17.50 46
## 2: Mohali 0 0 0 4.500000 10.00 15
## 3: Kolkata 0 0 0 17.750000 25.75 102
## 4: Hyderabad 0 0 1 17.666667 26.00 118
## 5: Chennai 0 0 0 17.333333 22.00 80
## 6: Jaipur 0 0 0 5.714286 10.75 30
## 7: Bengaluru 0 0 0 5.461538 14.00 19
## 8: Pune 0 0 0 12.833333 9.75 64
## 9: Delhi 0 0 2 11.714286 14.75 55
## 10: Indore 0 0 0 7.750000 7.75 31
## 11: Visakhapatnam 0 0 0 0.000000 0.00 0
Statistics for the win_by_wickets grouping with variable team1
#summary statistics for the win_by_wickets for each team.
data.table(matches_o)[, as.list(summary(win_by_wickets)), by="team1"]## team1 Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1: Mumbai Indians 0 0 0.0 1.894737 3.50 8
## 2: Delhi Daredevils 0 0 5.0 3.666667 6.00 9
## 3: Royal Challengers Bangalore 0 0 4.5 3.500000 5.75 7
## 4: Rajasthan Royals 0 0 5.0 4.307692 6.00 9
## 5: Kolkata Knight Riders 0 0 5.0 3.933333 7.00 9
## 6: Kings XI Punjab 0 0 3.5 3.428571 5.75 10
## 7: Chennai Super Kings 0 0 2.0 3.000000 6.00 8
## 8: Sunrisers Hyderabad 0 0 3.0 3.250000 6.00 8
## 9: Delhi Capitals 0 0 2.5 2.833333 5.75 6
Frequency table for two categorical variables
# Generating the frequency table using table function
freq_table <- table(matches_o$winner, matches_o$toss_decision)
freq_table##
## bat field
## Chennai Super Kings 2 19
## Delhi Capitals 2 7
## Delhi Daredevils 1 4
## Kings XI Punjab 1 11
## Kolkata Knight Riders 1 14
## Mumbai Indians 4 13
## Rajasthan Royals 4 8
## Royal Challengers Bangalore 0 11
## Sunrisers Hyderabad 5 11
We thought to divided the column “toss_decision” to know which who is the winner and what of decision they took and by what number of runs they won the match.
By using pivot_wider
#created a pivot-wider for toss_decision
decision_wider <- matches_o %>%
pivot_wider(id_cols= id:toss_winner,
names_from = toss_decision,
values_from = win_by_runs,
values_fill = 0)
decision_wider## # A tibble: 118 × 9
## id season city date team1 team2 toss_…¹ field bat
## <int> <int> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 7894 2018 Mumbai 07/04/18 Mumbai Indians Chen… Chenna… 0 0
## 2 7895 2018 Mohali 08/04/18 Delhi Daredevils King… Kings … 0 0
## 3 7896 2018 Kolkata 08/04/18 Royal Challengers … Kolk… Kolkat… 0 0
## 4 7897 2018 Hyderabad 09/04/18 Rajasthan Royals Sunr… Sunris… 0 0
## 5 7898 2018 Chennai 10/04/18 Kolkata Knight Rid… Chen… Chenna… 0 0
## 6 7899 2018 Jaipur 11/04/18 Rajasthan Royals Delh… Delhi … 10 0
## 7 7900 2018 Hyderabad 12/04/18 Mumbai Indians Sunr… Sunris… 0 0
## 8 7901 2018 Bengaluru 13/04/18 Kings XI Punjab Roya… Royal … 0 0
## 9 7902 2018 Mumbai 14/04/18 Mumbai Indians Delh… Delhi … 0 0
## 10 7903 2018 Kolkata 14/04/18 Kolkata Knight Rid… Sunr… Sunris… 0 0
## # … with 108 more rows, and abbreviated variable name ¹toss_winner
Data Dictionary
Create a data dictionary showcasing the variables used in your analyses
# create a new data with only one row for data dictionary
mat <- head(matches, 1)
del <- head(deliveries, 1)
# merging two different data into one dataset
Match_Del <- merge(mat,del)
# Extracting only required columns used in the analysis
Match_Del_new <- subset(Match_Del, select=c("win_by_runs", "city", "win_by_wickets", "team1", "team2","winner","toss_decision","toss_winner", "batsman_runs"))
# Creating dictionary table for used variables
dataDictionary <- tibble(Variable = colnames(Match_Del_new),
Description = c("Winning run by batting team",
"Matches held in which city",
"Winning wickets by bowling team",
"Teams in Group 1", "Teams in Group 2",
"Name of the Winning Team",
"Decision taken by team either bat or field",
"Name of the team winning toss",
"Number of runs scored by each player"),
Type = map_chr(Match_Del_new, .f = function(x){typeof(x)[1]}))
knitr::kable(dataDictionary)| Variable | Description | Type |
|---|---|---|
| win_by_runs | Winning run by batting team | integer |
| city | Matches held in which city | character |
| win_by_wickets | Winning wickets by bowling team | integer |
| team1 | Teams in Group 1 | character |
| team2 | Teams in Group 2 | character |
| winner | Name of the Winning Team | character |
| toss_decision | Decision taken by team either bat or field | character |
| toss_winner | Name of the team winning toss | character |
| batsman_runs | Number of runs scored by each player | integer |
Data Visualizations
Top 10 players with highest number of runs
Bar Chart
We have created a interactive graph for bar chart using plotly
For any game particularly for cricket, we need to check which total number of batsman runs. This will help us in analyzing the run strike of each batsman. To do that, we will be using the deliveries data set.
# Creating new variable using the deliveries data set
Top_Batsman<- deliveries %>%
group_by(batsman)%>%
summarise(runs=sum(batsman_runs)) %>%
arrange((runs)) %>%
filter(runs > 3000)
# Creating new variable for top_10 batsman
Top_10_Batsman <- Top_Batsman %>%
top_n(n=10,wt=runs) %>%
ggplot(aes(reorder(batsman, -runs),runs,fill=batsman)) +
labs(title = "Top 10 Batsman with highest number of runs in IPL",
x= "Batsman",
y= "Runs",
caption = "Data source: IPL Dataset from Kaggle")+
scale_fill_viridis_d()+
geom_bar(stat = "identity")+
geom_text(aes(label = runs),
vjust = 0.5, size= 3) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.4),
legend.position = "none")
ggplotly(Top_10_Batsman)From the above plot, we can say that player “V Kohli” as the highest number of runs “5434”.
Total number of matches in each city
Line Chart
# Creating new dataframe for the line chart
matches_cities <- matches %>% select(id:winner)%>%
group_by(city) %>%
summarise(Total= n())
# Generating new plot using the above data frame
Different_cities<- matches_cities %>%
filter(!is.na(city)) %>%
ggplot()+
aes(x= city, y = Total, group= 1)+
geom_line(color = "#00abff")+
labs(title = "Number of Matches played in different cities",
x= "City",
y= "Total Matches",
caption = "Data source: IPL Dataset from Kaggle")+
scale_color_continuous()+
geom_text(aes(label = Total),
vjust = -0.125) +
theme_bw()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
Different_citiesAbove line chart clearly indicates that most number of matches where organized in “Mumbai= 101” and the least number of matches where organized in “Bloemfontein = 2”
Teams which have won the highest number of toss
Pie-Chart
# creating a new variable matches_p to get the highest number of toss winner w.r.t to teams
matches_p <- matches %>%
group_by(toss_winner)%>%
summarise(Percentage= n())
matches_p$Percentage <- round(matches_p$Percentage/sum(matches_p$Percentage)*100, digits = 1)
matches_p_new <- matches_p %>%
top_n(n=10, wt= Percentage)
# Generating a pie chart for the highest percentage highest number of toss
matches_p_new %>% ggplot()+
aes(x = "", y = -Percentage,fill = reorder(toss_winner, -Percentage)) +
geom_bar(stat = "identity", width= 1, color = "black") +
labs(title = "Team with highest toss winning (%)",
caption = "Data source: IPL Dataset from Kaggle",
fill ="Winning Teams") +
coord_polar("y", start = 0) +
theme_void()+
geom_text(aes(label = Percentage), position = position_stack(vjust = 0.5),
color = "black", size=2.9)+
scale_color_viridis_d()The pie chart clearly shows that the team “Mumbai Indians” has won the toss more number of times in comparison of other teams. However, team “Pune” lowest number in winning the toss.
Merge at least two tables, and create a plot or table of summary statistics that is a result of the merged data set
Stacked Bar-chart with Line Graph
# creating two tables from the matches data set
matches_won<-as.data.frame(table(matches$winner))
matches_played<-as.data.frame(table(matches$team2) + table(matches$team1))
# Re-writing the column names for the above data sets
colnames(matches_won) <- c('Team','Won')
colnames(matches_played) <- c('Team','Played')
# merging above two data sets with the function merge
matches_w_p <- merge(matches_won, matches_played)
matches_per <- matches_w_p %>%
group_by(Team, Won, Played)%>%
summarise(Win_Percent = round((Won/Played)*100, digit=0))
matches_per_new <- as.data.frame(matches_per)
# Generating new plot with the merged data set using pivot_longer
# Stacked Bar chart with line graph on top
Stacked_Bar_Line <- matches_per_new %>% pivot_longer(Won:Played)%>%
ggplot(aes(x = Team)) +
geom_bar(stat = "identity", aes(y = value,fill = name))+
geom_line(aes(y = 3*Win_Percent), size = 0.5, color="red", group = 1)+
geom_text(position=position_stack(vjust = .5),
aes(x = Team, y = value, label = value), size= 3)+
scale_y_continuous(
name = "Won & Played",
breaks = seq (0, 300, 50),
sec.axis = sec_axis(~.*2/3, name="Win Percentage %", breaks = seq (0, 300, 50)))+
labs(title = "Total number of Matches Played vs Won by each team",
x= "Teams",
y= "Count",
fill= "",
caption = "Data source: IPL Dataset from Kaggle")+
scale_fill_manual(values = c("grey47", "grey"))+
theme_classic()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.50, size = 5.2),
legend.position = "right")
Stacked_Bar_LineAccording to the stacked bar chart above, the Mumbai Indians have played the most matches and won the most games overall.
Count for the number of 50s and 100s in IPL
Histogram
# creating histogram to check the 50s and 100s in IPL
Hist_graph <- hist(matches$win_by_runs,
main="Maximum number of 50s and 100s in IPL game",
xlab="Number of runs in IPL",
ylab= "Frequency of runs",
col = "darkslategray1")
text(Hist_graph$mids,Hist_graph$counts,labels=Hist_graph$counts, adj=c(0.5, -0.5))As a fan of cricket match, it will be very curious to know cumulative number of 50s and 100s throught out the IPL season, for that the above histogram will be helpful. However, from the histogram we can see that the more number in the IPL are reported between the 0 and 10.
BootStrap and Monte Carlo Simulation
Implement at least one permutation test based on a traditional hypothesis test, such as a two-sample t-test or a chi-squared test of independence, to test a hypothesis of interest for your data
# calculating the difference in samples test
x <- mean(matches_o$win_by_runs[matches_o$winner=="Sunrisers Hyderabad"])
y <- mean(matches_o$win_by_runs[matches_o$winner=="Chennai Super Kings"])
# calculating the absolute mean value
t_sam <- abs(mean(matches_o$win_by_runs[matches_o$winner=="Sunrisers Hyderabad"])-
mean(matches_o$win_by_runs[matches_o$winner=="Chennai Super Kings"]))
# observations of sample
n <- length(matches_o$winner)
# number of permutations
p <- 100
variable <- matches_o$win_by_runs
# Permutation Samples
PermSamp <- matrix(0, nrow= n, ncol = p)
# Recurring loop for the sample generator
P_S <- for (i in 1:p) {
PermSamp[,i] <- sample(variable, size=n, replace= FALSE)
}
Perm_t_sam <- rep(0,p)
# loop to calculate t-test
P_S_1 <- for (i in 1:p) {
Perm_t_sam[i] <- abs(mean(PermSamp[matches_o$winner=="Sunrisers Hyderabad",i])-
mean(PermSamp[matches_o$winner=="Chennai Super Kings",i]))
}
# Our hypothesis to check the probability of the Permutated test value greater than the observed test value.
Hypothesis_value <- mean((Perm_t_sam >= t_sam)[1:15])*100Our main hypothesis is that there will be minimum percentage of samples which will be greater than the observed sample test value. However, finally we got to know the 46.7 % greater than the observed test value for the 15 permutations. Hence we are rejecting our alternate hypothesis which we assumed the values will be “zero”.
Obtain a parametric and nonparametric bootstrap-estimated standard error for at least one statistic of interest
Non-Parametric Bootstrap- estimated error for chi-square test
# Simulating data from distribution
set.seed(1989)
n<- 30
# Initiating data frame as win_by_wickets
observed <- matches$win_by_wickets
# Sample median
median(observed)## [1] 4
# Number of bootstrap samples
B<-10000
# Instantiating matrix for bootstrap samples
boots <- matrix (NA, nrow=n, ncol=B)
#Sampling with replacement B times
for(b in 1:B) {
boots[, b] <- observed[sample(1:n, size= n, replace = TRUE)]
}
#Instantiating vector for bootstrap medians
bootMedians <- vector(length= B)
# Sampling with replacement B times
for (b in 1:B) {
boots[, b] <- observed [sample(1:n, size = n, replace = TRUE)]
}
# Instantiating vector for bootstrap medians
bootMedians <- vector (length = B)
# Sampling with replacement B times
for (b in 1:B) {
bootMedians [b] <- median (boots [, b])
}
# Nonparametric estimate of the SE of the sample median
SEestimate <- sd (bootMedians)
SEestimate## [1] 1.83859
Parametric Bootstrap- estimated error for chi-square test
# Number of bootstrap samples
B < - 10000## [1] FALSE
#Instantiating matrix for bootstrap samples
paramBoots <- matrix(NA, nrow = n, ncol = B)
XBar <- mean(observed)
s <- sd(observed)
# Simulating a normal set of n values, B times
for(b in 1:B){
paramBoots[, b] <- rnorm(n = n, mean = XBar, sd = s)
}
# Instantiating vector for bootstrap medians
bootParamMedians <- vector(length = B)
#Calculating median for each simulated data set
for(b in 1:B) {
bootParamMedians[b] <- median(paramBoots[, b])
}
# Nonparametric estimate of the SE of the sample median
SEparamEstimate <- sd(bootParamMedians)
SEparamEstimate## [1] 0.7585229
Collaboration
Division of Work
| Team Member | Contribution |
|---|---|
| Maruthi Sai Phani Teja Chadalapaka | Data Visualizations(Line graph, Stacked Bar chart with line graph), Bootstrap and Monte Carlo Simulation, Summary Statistics, Data Dictionary, Dashboard Creation(Flex Dashboard). |
| Mounika Balreddyguda | Introduction, Explaining Datasets, Missing Values, Lubridate, Stringr, Data Visualization( Pie-chart, Interactive Bar-Chart, histogram) |
Dashboard Link
This dashboard provides visualization of top batsman, highest tossing rate and runs summary table by grouping with city name.
Dashboard Link: http://rpubs.com/Maruthi_17/982174